{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Reinforcement Learning\n",
    "\n",
    "This Jupyter notebook acts as supporting material for **Chapter 21 Reinforcement Learning** of the book* Artificial Intelligence: A Modern Approach*. This notebook makes use of the implementations in `rl.py` module. We also make use of implementation of MDPs in the `mdp.py` module to test our agents. It might be helpful if you have already gone through the Jupyter notebook dealing with Markov decision process. Let us import everything from the `rl` module. It might be helpful to view the source of some of our implementations. Please refer to the Introductory Jupyter notebook for more details."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from rl import *"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## CONTENTS\n",
    "\n",
    "* Overview\n",
    "* Passive Reinforcement Learning\n",
    "    - Direct Utility Estimation\n",
    "    - Adaptive Dynamic Programming\n",
    "    - Temporal-Difference Agent\n",
    "* Active Reinforcement Learning\n",
    "    - Q learning"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "## OVERVIEW\n",
    "\n",
    "Before we start playing with the actual implementations let us review a couple of things about RL.\n",
    "\n",
    "1. Reinforcement Learning is concerned with how software agents ought to take actions in an environment so as to maximize some notion of cumulative reward. \n",
    "\n",
    "2. Reinforcement learning differs from standard supervised learning in that correct input/output pairs are never presented, nor sub-optimal actions explicitly corrected. Further, there is a focus on on-line performance, which involves finding a balance between exploration (of uncharted territory) and exploitation (of current knowledge).\n",
    "\n",
    "-- Source: [Wikipedia](https://en.wikipedia.org/wiki/Reinforcement_learning)\n",
    "\n",
    "In summary we have a sequence of state action transitions with rewards associated with some states. Our goal is to find the optimal policy $\\pi$ which tells us what action to take in each state."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## PASSIVE REINFORCEMENT LEARNING\n",
    "\n",
    "In passive Reinforcement Learning the agent follows a fixed policy $\\pi$. Passive learning attempts to evaluate the given policy $pi$ - without any knowledge of the Reward function $R(s)$ and the Transition model $P(s'\\ |\\ s, a)$.\n",
    "\n",
    "This is usually done by some method of **utility estimation**. The agent attempts to directly learn the utility of each state that would result from following the policy. Note that at each step, it has to *perceive* the reward and the state - it has no global knowledge of these. Thus, if a certain the entire set of actions offers a very low probability of attaining some state $s_+$ - the agent may never perceive the reward $R(s_+)$.\n",
    "\n",
    "Consider a situation where an agent is given a policy to follow. Thus, at any point it knows only its current state and current reward, and the action it must take next. This action may lead it to more than one state, with different probabilities.\n",
    "\n",
    "For a series of actions given by $\\pi$, the estimated utility $U$:\n",
    "$$U^{\\pi}(s) = E(\\sum_{t=0}^\\inf \\gamma^t R^t(s')$$)\n",
    "Or the expected value of summed discounted rewards until termination.\n",
    "\n",
    "Based on this concept, we discuss three methods of estimating utility:\n",
    "\n",
    "1. **Direct Utility Estimation (DUE)**\n",
    " \n",
    " The first, most naive method of estimating utility comes from the simplest interpretation of the above definition. We construct an agent that follows the policy until it reaches the terminal state. At each step, it logs its current state, reward. Once it reaches the terminal state, it can estimate the utility for each state for *that* iteration, by simply summing the discounted rewards from that state to the terminal one.\n",
    "\n",
    " It can now run this 'simulation' $n$ times, and calculate the average utility of each state. If a state occurs more than once in a simulation, both its utility values are counted separately.\n",
    " \n",
    " Note that this method may be prohibitively slow for very large statespaces. Besides, **it pays no attention to the transition probability $P(s'\\ |\\ s, a)$.** It misses out on information that it is capable of collecting (say, by recording the number of times an action from one state led to another state). The next method addresses this issue.\n",
    " \n",
    "2. **Adaptive Dynamic Programming (ADP)**\n",
    " \n",
    " This method makes use of knowledge of the past state $s$, the action $a$, and the new perceived state $s'$ to estimate the transition probability $P(s'\\ |\\ s,a)$. It does this by the simple counting of new states resulting from previous states and actions.<br> \n",
    " The program runs through the policy a number of times, keeping track of:\n",
    "    - each occurrence of state $s$ and the policy-recommended action $a$ in $N_{sa}$\n",
    "    - each occurrence of $s'$ resulting from $a$ on $s$ in $N_{s'|sa}$.\n",
    "     \n",
    " It can thus estimate $P(s'\\ |\\ s,a)$ as $N_{s'|sa}/N_{sa}$, which in the limit of infinite trials, will converge to the true value.<br>\n",
    " Using the transition probabilities thus estimated, it can apply `POLICY-EVALUATION` to estimate the utilities $U(s)$ using properties of convergence of the Bellman functions.\n",
    "\n",
    "3. **Temporal-difference learning (TD)**\n",
    " \n",
    " Instead of explicitly building the transition model $P$, the temporal-difference model makes use of the expected closeness between the utilities of two consecutive states $s$ and $s'$.\n",
    " For the transition $s$ to $s'$, the update is written as:\n",
    "$$U^{\\pi}(s) \\leftarrow U^{\\pi}(s) + \\alpha \\left( R(s) + \\gamma U^{\\pi}(s') - U^{\\pi}(s) \\right)$$\n",
    " This model implicitly incorporates the transition probabilities by being weighed for each state by the number of times it is achieved from the current state. Thus, over a number of iterations, it converges similarly to the Bellman equations.\n",
    " The advantage of the TD learning model is its relatively simple computation at each step, rather than having to keep track of various counts.\n",
    " For $n_s$ states and $n_a$ actions the ADP model would have $n_s \\times n_a$ numbers $N_{sa}$ and $n_s^2 \\times n_a$ numbers $N_{s'|sa}$ to keep track of. The TD model must only keep track of a utility $U(s)$ for each state."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Demonstrating Passive agents\n",
    "\n",
    "Passive agents are implemented in `rl.py` as various `Agent-Class`es.\n",
    "\n",
    "To demonstrate these agents, we make use of the `GridMDP` object from the `MDP` module. `sequential_decision_environment` is similar to that used for the `MDP` notebook but has discounting with $\\gamma = 0.9$.\n",
    "\n",
    "The `Agent-Program` can be obtained by creating an instance of the relevant `Agent-Class`. The `__call__` method allows the `Agent-Class` to be called as a function. The class needs to be instantiated with a policy ($\\pi$) and an `MDP` whose utility of states will be estimated."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from mdp import sequential_decision_environment"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `sequential_decision_environment` is a GridMDP object as shown below. The rewards are **+1** and **-1** in the terminal states, and **-0.04** in the rest. <img src=\"files/images/mdp.png\"> Now we define actions and a policy similar to **Fig 21.1** in the book."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Action Directions\n",
    "north = (0, 1)\n",
    "south = (0,-1)\n",
    "west = (-1, 0)\n",
    "east = (1, 0)\n",
    "\n",
    "policy = {\n",
    "    (0, 2): east,  (1, 2): east,  (2, 2): east,   (3, 2): None,\n",
    "    (0, 1): north,                (2, 1): north,  (3, 1): None,\n",
    "    (0, 0): north, (1, 0): west,  (2, 0): west,   (3, 0): west, \n",
    "}\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "###  Direction Utility Estimation Agent\n",
    "\n",
    "The `PassiveDEUAgent` class in the `rl` module implements the Agent Program described in **Fig 21.2** of the AIMA Book. `PassiveDEUAgent` sums over rewards to find the estimated utility for each state. It thus requires the running of a number of iterations."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "%psource PassiveDUEAgent"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "DUEagent = PassiveDUEAgent(policy, sequential_decision_environment)\n",
    "for i in range(200):\n",
    "    run_single_trial(DUEagent, sequential_decision_environment)\n",
    "    DUEagent.estimate_U()\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The calculated utilities are:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print('\\n'.join([str(k)+':'+str(v) for k, v in DUEagent.U.items()]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Adaptive Dynamic Programming Agent\n",
    "\n",
    "The `PassiveADPAgent` class in the `rl` module implements the Agent Program described in **Fig 21.2** of the AIMA Book. `PassiveADPAgent` uses state transition and occurrence counts to estimate $P$, and then $U$. Go through the source below to understand the agent."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "%psource PassiveADPAgent"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We instantiate a `PassiveADPAgent` below with the `GridMDP` shown and train it over 200 iterations. The `rl` module has a simple implementation to simulate iterations. The function is called **run_single_trial**."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "ADPagent = PassiveADPAgent(policy, sequential_decision_environment)\n",
    "for i in range(200):\n",
    "    run_single_trial(ADPagent, sequential_decision_environment)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The calculated utilities are:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "print('\\n'.join([str(k)+':'+str(v) for k, v in ADPagent.U.items()]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Passive Temporal Difference Agent\n",
    "\n",
    "`PassiveTDAgent` uses temporal differences to learn utility estimates. We learn the difference between the states and backup the values to previous states.  Let us look into the source before we see some usage examples."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "%psource PassiveTDAgent"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In creating the `TDAgent`, we use the **same learning rate** $\\alpha$ as given in the footnote of the book on **page 837**."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "TDagent = PassiveTDAgent(policy, sequential_decision_environment, alpha = lambda n: 60./(59+n))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we run **200 trials** for the agent to estimate Utilities."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "for i in range(200):\n",
    "    run_single_trial(TDagent,sequential_decision_environment)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The calculated utilities are:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print('\\n'.join([str(k)+':'+str(v) for k, v in TDagent.U.items()]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Comparison with value iteration method\n",
    "\n",
    "We can also compare the utility estimates learned by our agent to those obtained via **value iteration**.\n",
    "\n",
    "**Note that value iteration has a priori knowledge of the transition table $P$, the rewards $R$, and all the states $s$.**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from mdp import value_iteration"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The values calculated by value iteration:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "U_values = value_iteration(sequential_decision_environment)\n",
    "print('\\n'.join([str(k)+':'+str(v) for k, v in U_values.items()]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Evolution of utility estimates over iterations\n",
    "\n",
    "We can explore how these estimates vary with time by using plots similar to **Fig 21.5a**. We will first enable matplotlib using the inline backend. We also define a function to collect the values of utilities at each iteration."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "%matplotlib inline\n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    "def graph_utility_estimates(agent_program, mdp, no_of_iterations, states_to_graph):\n",
    "    graphs = {state:[] for state in states_to_graph}\n",
    "    for iteration in range(1,no_of_iterations+1):\n",
    "        run_single_trial(agent_program, mdp)\n",
    "        for state in states_to_graph:\n",
    "            graphs[state].append((iteration, agent_program.U[state]))\n",
    "    for state, value in graphs.items():\n",
    "        state_x, state_y = zip(*value)\n",
    "        plt.plot(state_x, state_y, label=str(state))\n",
    "    plt.ylim([0,1.2])\n",
    "    plt.legend(loc='lower right')\n",
    "    plt.xlabel('Iterations')\n",
    "    plt.ylabel('U')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here is a plot of state $(2,2)$."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "agent = PassiveTDAgent(policy, sequential_decision_environment, alpha=lambda n: 60./(59+n))\n",
    "graph_utility_estimates(agent, sequential_decision_environment, 500, [(2,2)])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It is also possible to plot multiple states on the same plot. As expected, the utility of the finite state $(3,2)$ stays constant and is equal to $R((3,2)) = 1$."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "graph_utility_estimates(agent, sequential_decision_environment, 500, [(2,2), (3,2)])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "## ACTIVE REINFORCEMENT LEARNING\n",
    "\n",
    "Unlike Passive Reinforcement Learning in Active Reinforcement Learning we are not bound by a policy pi and we need to select our actions. In other words the agent needs to learn an optimal policy. The fundamental tradeoff the agent needs to face is that of exploration vs. exploitation. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### QLearning Agent\n",
    "\n",
    "The QLearningAgent class in the rl module implements the Agent Program described in **Fig 21.8** of the AIMA Book. In Q-Learning the agent learns an action-value function Q which gives the utility of taking a given action in a particular state. Q-Learning does not required a transition model and hence is a model free method. Let us look into the source before we see some usage examples."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "%psource QLearningAgent"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The Agent Program can be obtained by creating the instance of the class by passing the appropriate parameters. Because of the __ call __ method the object that is created behaves like a callable and returns an appropriate action as most Agent Programs do. To instantiate the object we need a mdp similar to the PassiveTDAgent.\n",
    "\n",
    " Let us use the same GridMDP object we used above. **Figure 17.1 (sequential_decision_environment)** is similar to **Figure 21.1** but has some discounting as **gamma = 0.9**. The class also implements an exploration function **f** which returns fixed **Rplus** until agent has visited state, action **Ne** number of times. This is the same as the one defined on page **842** of the book. The method **actions_in_state** returns actions possible in given state. It is useful when applying max and argmax operations."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let us create our object now. We also use the **same alpha** as given in the footnote of the book on **page 837**. We use **Rplus = 2** and **Ne = 5** as defined on page 843. **Fig 21.7**  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "q_agent = QLearningAgent(sequential_decision_environment, Ne=5, Rplus=2, \n",
    "                         alpha=lambda n: 60./(59+n))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now to try out the q_agent we make use of the **run_single_trial** function in rl.py (which was also used above). Let us use **200** iterations."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "for i in range(200):\n",
    "    run_single_trial(q_agent,sequential_decision_environment)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now let us see the Q Values. The keys are state-action pairs. Where different actions correspond according to:\n",
    "\n",
    "north = (0, 1)\n",
    "south = (0,-1)\n",
    "west = (-1, 0)\n",
    "east = (1, 0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "q_agent.Q"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The Utility **U** of each state is related to **Q** by the following equation.\n",
    "\n",
    "**U (s) = max <sub>a</sub> Q(s, a)**\n",
    "\n",
    "Let us convert the Q Values above into U estimates.\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "U = defaultdict(lambda: -1000.) # Very Large Negative Value for Comparison see below.\n",
    "for state_action, value in q_agent.Q.items():\n",
    "    state, action = state_action\n",
    "    if U[state] < value:\n",
    "                U[state] = value"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "U"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let us finally compare these estimates to value_iteration results."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(value_iteration(sequential_decision_environment))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
